[SPARK-23984][K8S] Initial Python Bindings for PySpark on K8s by ifilonenko · Pull Request #21092 · apache/spark

ifilonenko · 2018-04-18T05:29:12Z

What changes were proposed in this pull request?

Introducing Python Bindings for PySpark.

Running PySpark Jobs
Increased Default Memory Overhead value
Dependency Management for virtualenv/conda

How was this patch tested?

This patch was tested with

Unit Tests
Integration tests with this addition

KubernetesSuite:
- Run SparkPi with no resources
- Run SparkPi with a very long application name.
- Run SparkPi with a master URL without a scheme.
- Run SparkPi with an argument.
- Run SparkPi with custom labels, annotations, and environment variables.
- Run SparkPi with a test secret mounted into the driver and executor pods
- Run extraJVMOptions check on driver
- Run SparkRemoteFileTest using a remote data file
- Run PySpark on simple pi.py example
- Run PySpark with Python2 to test a pyfiles example
- Run PySpark with Python3 to test a pyfiles example
Run completed in 4 minutes, 28 seconds.
Total number of tests run: 11
Suites: completed 2, aborted 0
Tests: succeeded 11, failed 0, canceled 0, ignored 0, pending 0
All tests passed.

Initial architecture for PySpark w/o dependency management

SparkQA · 2018-04-18T05:43:45Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/2365/

SparkQA · 2018-04-18T05:55:04Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/2365/

erikerlandson · 2018-04-18T15:29:17Z

Thanks @ifilonenko !
I'm interested in figuring out what it means for the container images to be "python 2/3 generic" - does that imply being able to run either, based on submit parameters?

foxish · 2018-04-18T17:00:40Z

cc @holdenk

foxish · 2018-04-18T17:59:35Z

+      "$SPARK_HOME/bin/spark-submit"
+      --conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS"
+      --deploy-mode client
+      "$@" $PYSPARK_PRIMARY $PYSPARK_SECONDARY


Can we have more descriptive names for PYSPARK_PRIMARY and PYSPARK_SECONDARY? Maybe PYSPARK_MAINAPP and PYSPARK_ARGS?

foxish · 2018-04-18T18:03:34Z

+    rm -r /usr/lib/python*/ensurepip && \
+    pip install --upgrade pip setuptools && \
+    rm -r /root/.cache
+ENV PYTHON_VERSION 2.7.13


If we set this, are we implicitly imposing a contract on the base image to have this particular version of python installed?

That is what I brought up in the PR description. And why this still a WIP. I need to investigate the proper way to determine whether we ship these containers with Python2 or Python3.

in some OSes, python vs python3 symlink to the installed version of python, respectively for the version 2.x and 3.x, is that a better approach then hardcoding the version number?

So I think it might make sense to build the container with both 2 & 3 since the container might be built by a vendor or cluster administrator and then used by a variety of people. What do folks think?

As for figuring out the env, if we wanted to do it that way we can call the current users python and ask it for its version version information (based on the Spark Python enviroment variables).

I think a canonical container should include both. My instinct is that a user should be able to "force" the use of one or the other. If someone is invoking spark-submit in cluster-mode, with a supplied python file, some kind of CLI argument (--conf or otherwise) seems like the only totally foolproof way to identify that for the eventual pod construction, but maybe there is a better way?

perhaps re-use PYSPARK_PYTHON?

foxish · 2018-04-18T18:09:51Z

  }

+  test("Apply Python step if main resource is python.") {
+    val conf = KubernetesConf(


Unrelated to this PR, but @mccheah, should we have something like the fluent/builder pattern here for KubernetesConf since it's grown to quite a few params. I'm happy to take a stab at it if we agree that's a good direction.

Depends on what we want to use the builder for. One advantage of a builder is handled by case classes already: the fact that you don't have to order arguments in a particular way; you can get around this by using named parameters when you construct the object. But, if you want to stage the construction of the object in multiple calls, then a builder will get you that while a case class by itself will not.

I think it would be neater to have a builder. The SparkSession Builder is an example from the project we can follow.

foxish · 2018-04-18T18:17:14Z

    }

-    val driverContainer = new ContainerBuilder(pod.container)
+    val withoutArgsDriverContainer: ContainerBuilder = new ContainerBuilder(pod.container)


The previous name seemed clearer to me.

is there a corresponding driver container with args?

Yes. look below

But we do set arguments on this one right? If not please insert a white space so I can see the different visually.

foxish · 2018-04-18T18:20:09Z

-    } else baseFeatures
+    val maybeRoleSecretNamesStep = if (kubernetesConf.roleSecretNamesToMountPaths.nonEmpty) {
+      Some(provideSecretsStep(kubernetesConf)) } else None
+    val allFeatures: Seq[KubernetesFeatureConfigStep] =


It does not need any changes/arg passing during executor pod construction?

No, but there will be more features and I thought that doing options in the setting of allFeatures was cleaner

foxish · 2018-04-18T18:22:32Z

-      .build()

+    val driverContainer =
+      if (driverDockerContainer == "driver-py") {


Wondering if we can discover if it's a Python application in a better way here. Probably using the built up spark conf?

We can check the appResource but that was already done. I thought it would be overkill to check twice since it was already handled in setting driverDockerContainer

I think in general I'd prefer having two separate step types here. They can share some logic in either a utils class or a shared superclass. But you only apply one step type for Java apps vs one step type for Python apps.

Another way is to have the basic driver step only do work that would be strictly agnostic of python vs java, and then have a separate step for either Java or Python; the orchestrator picks which one to invoke based on the app resource type. To do this I think the step's constructor needs to take more than just the KubernetesConf as an argument - it needs to take the appropriate specifically-typed MainAppResource as an argument in the constructor as well. This breaks the convention that we've set so far but for now that's probably ok, as long as we don't get parameter length blowup as we go forward.

The second way is the approach that I envisioned and tried to implement. It seems that the approach (without putting too much work on the KubernetesConf) breaks the contract we defined tho.

So what about applications which need Python support (e.g. have Python UDFS) but don't use a Python driver process?

So what about applications which need Python support (e.g. have Python UDFS) but don't use a Python driver process?

Think that's up to the user to make it work - I don't see this being specifically handled by the other cluster managers.

The goal of this PR should be to bring Kubernetes up to par with the other cluster managers with respect to what they provide.Do the other cluster managers provide any specific support for this?

We currently are only running the Python and future R step when we are leveraging a Python (or R) driver process. Else the user would just specify the spark-py docker-image no? and then just continue to run a non-Python driver process.

Sorry, I was forgot that folks could specify the driver container separately from the worker container nvm.

@ifilonenko I think this still needs some work to clean up.

What I expect to happen is to have three step types:

BasicDriverFeatureStep, which is what's here except we don't provide the args to the container in this step anymore.

PythonDriverFeatureStep which does both what the PythonDriverFeatureStep does currently plus adds the driver-py argument

JavaDriverFeatureStep which only adds the argument SparkLauncher.NO_RESOURCE, conf.roleSpecificConf.appArgs, etc.

Then in the KubernetesDriverBuilder, always apply the first step, and select which of 2 or 3 to apply based on the app resource type.

Agreed. I didn't know if we wanted to include a JavaDriverFeatureStep. I will do so then.

foxish · 2018-04-18T18:23:35Z

Thanks for taking this on @ifilonenko. Left some initial comments on the PR without going too much in depth - since as you noted, it's WIP.

holdenk

I've got some initial feedback and questions but I'm really excited to see the progress. One thing which I'm a little worried about and wasn't aware of is that the integration tests appear to be living in a seperate non-ASF repo? Whats the story behind that and can we do anything to bring those in?

holdenk · 2018-04-20T19:49:18Z

 Options:
-  -f file     Dockerfile to build. By default builds the Dockerfile shipped with Spark.
+  -f file     Dockerfile to build for JVM based Jobs. By default builds the Dockerfile shipped with Spark.
+  -p file     Dockerfile with Python baked in. By default builds the Dockerfile shipped with Spark.


One (future concern) is how we would to handle the overlay with both Python and R at the same time.

holdenk · 2018-04-20T19:53:34Z

      childMainClass = KUBERNETES_CLUSTER_SUBMIT_CLASS
      if (args.primaryResource != SparkLauncher.NO_RESOURCE) {
-        childArgs ++= Array("--primary-java-resource", args.primaryResource)
+        if (args.isPython) {


This logic appears to duplicated from YARN, would it make sense to factor this out into a common function?

We chatted about this off-line and while its close its not exactly the same so we can deal with minor parts of duplication for now.

holdenk · 2018-04-20T19:54:30Z

+  val MEMORY_OVERHEAD_FACTOR =
+    ConfigBuilder("spark.kubernetes.memoryOverheadFactor")
+      .doc("This sets the Memory Overhead Factor that will allocate memory to non-JVM jobs " +
+        "which in the case of JVM tasks will default to 0.10 and 0.40 for non-JVM jobs")


+1 to this thanks for adding this.

holdenk · 2018-04-20T19:57:42Z

+              sparkConfWithMainAppJar.set(KUBERNETES_PYSPARK_MAIN_APP_RESOURCE, res)
+              sparkConfWithMainAppJar.set(KUBERNETES_PYSPARK_APP_ARGS, appArgs.mkString(" "))
+          }
+          sparkConfWithMainAppJar.set(MEMORY_OVERHEAD_FACTOR, 0.4)


So wait, if the user has specified a different value I don't think we should override it and its not clear to me that this code will not override a user specified value.

Very true, will need to ensure that it does not override the set value

holdenk · 2018-04-20T20:00:01Z

-      .build()

+    val driverContainer =
+      if (driverDockerContainer == "driver-py") {


So what about applications which need Python support (e.g. have Python UDFS) but don't use a Python driver process?

holdenk · 2018-04-20T20:05:25Z

+    assert(kubernetesConfWithoutMainJar.sparkConf.get(MEMORY_OVERHEAD_FACTOR) === 0.1)
  }

+  test("Creating driver conf with a python primary file") {


Would like also see a unit test for with a PyFile and an overriden memory overhead.

Defaults are checked on 96 and 117. (But I need to ensure that it is possible to override as well. Will add)

Just a follow up we should have a test for with Python and overriding MEMORY_OVERHEAD_FACTOR (e.g. test to make sure that setIfMissing since we had it the other way earlier in the PR).

holdenk · 2018-04-20T20:08:12Z

+    rm -r /usr/lib/python*/ensurepip && \
+    pip install --upgrade pip setuptools && \
+    rm -r /root/.cache
+ENV PYTHON_VERSION 2.7.13


So I think it might make sense to build the container with both 2 & 3 since the container might be built by a vendor or cluster administrator and then used by a variety of people. What do folks think?

As for figuring out the env, if we wanted to do it that way we can call the current users python and ask it for its version version information (based on the Spark Python enviroment variables).

holdenk · 2018-04-20T20:09:54Z

+COPY python /opt/spark/python
+RUN apk add --no-cache python && \
+    python -m ensurepip && \
+    rm -r /usr/lib/python*/ensurepip && \


Can we add a comment about why this part?

holdenk · 2018-04-20T20:10:19Z

+    python -m ensurepip && \
+    rm -r /usr/lib/python*/ensurepip && \
+    pip install --upgrade pip setuptools && \
+    rm -r /root/.cache


Is this just being done for space reasons?

holdenk · 2018-04-20T20:11:52Z

+ENV PYTHON_VERSION 2.7.13
+ENV PYSPARK_PYTHON python
+ENV PYSPARK_DRIVER_PYTHON python
+ENV PYTHONPATH ${SPARK_HOME}/python/:${SPARK_HOME}/python/lib/py4j-0.10.6-src.zip:${PYTHONPATH}


We're going to need to mention the Py4J zip file needs to be updated here as well :(
Also open question if we want the PySpark.zip file in here instead of the python/, and or if we're trying to make "slim" images if we want to delete that zip file.

holdenk · 2018-04-20T20:20:31Z

Other not directly related to the code feedback is in the example I would expect sort to be passed as an argument to pi from the quick reading of it and also from using just regular spark submit in local mode so I wouldn't expect spark-submit to not treat it as an argument. Are you just looking to add sort.py as to the users python path so it's included as a resource? If so I think updating the env variables or using --py-files is the way to go. If I've missunderstood that question/example though no stress :)

And thank you so much for working on this I'm super excited to see the progress. Sorry for only the quick first-pass review but I figured since its a work in progress that is what you are looking for. If you want more detailed feedback please ping me :)

mccheah · 2018-04-20T20:28:14Z

Integration tests are meant to be in this repository but we haven't gotten there yet. See #20697

ifilonenko · 2018-04-20T20:39:25Z

+      "$SPARK_HOME/bin/spark-submit"
+      --conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS"
+      --deploy-mode client
+      "$@" $PYSPARK_PRIMARY $PYSPARK_SECONDARY


@holdenk I thought the PythonRunner takes in a comma delineated string of PyFiles. as an argument which is why I set it to be --class PythonRunner $PYSPARK_PRIMARY $PYSPARK_FILES $PYSPARK_DRIVER_ARGS

…r python3 to be specified

resolved comments and fixed --pyfiles issue and allowed for python2 o…

SparkQA · 2018-05-02T08:15:33Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/2729/

SparkQA · 2018-05-02T08:21:14Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/2729/

erikerlandson · 2018-05-02T15:31:40Z

@holdenk I think your comment above gets at a use-case "ambiguity" that containerization causes. There are now at least two choices of channel for supplying dependencies: from the command line, or by customized container (and here there are at least two sub-cases: manually created customizations, or via source-to-image tooling).

When specifying deps via the command line, particularly in cluster mode, we have backed out of staging local files via init-container; does pulling from URI suffice?

ifilonenko · 2018-05-02T15:41:53Z

@shaneknapp @ssuchter integration tests seem to be failing not due to this PR, but in general. Please investigate, because this PR does pass integration tests + an extra PySpark test.

Error:

Error starting host: Temporary Error: Error configuring auth on host: Temporary Error: ssh command error:
command : sudo systemctl -f restart docker

ifilonenko · 2018-05-02T16:12:55Z

retest this please

SparkQA · 2018-05-02T16:30:39Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/2753/

SparkQA · 2018-05-02T16:36:10Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/2753/

holdenk · 2018-05-04T22:41:02Z

@erikerlandson I think pulling from URI is fine for now. The actual comment was just focused on the usage of spark-submit in that case, but I agree longer term we should think about dependencies, especially things which can't just be shipped as zip or pyfiles (but I think that is vNext).

holdenk

Thanks for working on this! Really excited to see move from WIP to closer and hope we can get this in soon. I have some more questions and feedback, let me know if you need anything else from me.

holdenk · 2018-05-04T22:43:19Z

      childMainClass = KUBERNETES_CLUSTER_SUBMIT_CLASS
      if (args.primaryResource != SparkLauncher.NO_RESOURCE) {
-        childArgs ++= Array("--primary-java-resource", args.primaryResource)
+        if (args.isPython) {


We chatted about this off-line and while its close its not exactly the same so we can deal with minor parts of duplication for now.

holdenk · 2018-05-04T22:55:19Z

+            case PythonMainAppResource(res) =>
+              additionalFiles += res
+              maybePyFiles.foreach{maybePyFiles =>
+                additionalFiles.appendAll(maybePyFiles.split(","))}


Not for this PR or JIRA, but for later maybe we should normalize our parsing of input files in a way which allows escape characters and share the logic between Yarn/K8s/Mesos/standalone. What do y'all think? Possible follow up JIRA: https://issues.apache.org/jira/browse/SPARK-24184

holdenk · 2018-05-04T23:04:04Z

+              sparkConfWithMainAppJar.set(KUBERNETES_PYSPARK_MAIN_APP_RESOURCE, res)
+              sparkConfWithMainAppJar.set(KUBERNETES_PYSPARK_APP_ARGS, appArgs.mkString(" "))
+          }
+          sparkConfWithMainAppJar.setIfMissing(MEMORY_OVERHEAD_FACTOR, 0.4)


Do we want to set this in the JVM case?

This is set later in BaseDriverStep

holdenk · 2018-05-04T23:05:29Z

    }

-    val driverContainer = new ContainerBuilder(pod.container)
+    val withoutArgsDriverContainer: ContainerBuilder = new ContainerBuilder(pod.container)


But we do set arguments on this one right? If not please insert a white space so I can see the different visually.

holdenk · 2018-05-04T23:08:31Z

-      .build()

+    val driverContainer =
+      if (driverDockerContainer == "driver-py") {


Sorry, I was forgot that folks could specify the driver container separately from the worker container nvm.

holdenk · 2018-05-04T23:10:45Z

+    require(mainResource.isDefined, "PySpark Main Resource must be defined")
+    val otherPyFiles = kubernetesConf.pyFiles().map(pyFile =>
+      KubernetesUtils.resolveFileUrisAndPath(pyFile.split(","))
+        .mkString(":")).getOrElse("")


Leave a comment that we are switching from "," to ":" to match the format expected by the PYTHONPATH environment variable. ( http://xkcd.com/1987 )

holdenk · 2018-05-04T23:13:43Z

+        .endEnv()
+      .addNewEnv()
+        .withName(ENV_PYSPARK_FILES)
+        .withValue(if (otherPyFiles == "") {""} else otherPyFiles)


wait, what is this logic?

Don't add empty env vars - see above.

holdenk · 2018-05-04T23:15:43Z

      MAIN_CLASS,
-      APP_ARGS)
+      APP_ARGS,
+      None)


Still want names.

SparkQA · 2018-06-01T15:04:01Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/3623/

SparkQA · 2018-06-01T15:09:48Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/3623/

holdenk

Awesome, excited to see this moving forward. Really looking forward to seeing improved integration tests in apache-spark-on-k8s/spark-integration#46.

holdenk · 2018-06-01T16:29:51Z

+                additionalFiles.appendAll(maybePyFiles.split(","))}
+              sparkConfWithMainAppJar.set(KUBERNETES_PYSPARK_MAIN_APP_RESOURCE, res)
+          }
+          sparkConfWithMainAppJar.set(MEMORY_OVERHEAD_FACTOR, 0.4)


Yup, you can see my statement about not overriding the explicitly user provided value in comment on the 20th ("if the user has specified a different value don't think we should override it").

So this logic, as it stands, is K8s specific and I don't think we we can change how YARN chooses its memory overhead in a minor release, so I'd expect this to remain K8s specific until at least 3.0 when we can evaluate if we want to change this in YARN as well.

The memory overhead configuration notice done in the YARN page right now
(see spark.yarn.am.memoryOverhead on http://spark.apache.org/docs/latest/running-on-yarn.html ). So I would document this in http://spark.apache.org/docs/latest/running-on-kubernetes.html#spark-properties e.g. ./docs/running-on-kubernetes.md).

As for intuitive I'd argue that this actually is more intuitive than what we do in YARN, we know that users who run R & Python need more non-JVM heap space and many users don't know to think about this until their job fails. We can take advantage of our knowledge to handle this setting for the user more often. You can see how often this confuses folks on the list, docs, and stack overflow by looking at "memory overhead exceeded" and "Container killed by YARN for exceeding memory limits" and similar.

SparkQA · 2018-06-01T18:20:23Z

Test build #91394 has finished for PR 21092 at commit 24a704e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

kokes · 2018-06-07T06:41:26Z

+      .stringConf
+      .checkValue(pv => List("2", "3").contains(pv),
+        "Ensure that Python Version is either Python2 or Python3")
+      .createWithDefault("2")


Am I reading this right that the default is Python 2? Is there a reason for that? Thanks!

No particular reason. I just thought that the major version should default to 2.

There is only ~18 months of support left for Python 2. Python 3 has been around for 10 years and unless there’s a good reason, I think it should be the default.

I am willing to do that: thoughts @holdenk ?

I'm fine with either as the default. While Py2 is officially EOL I think we'll still see PySpark Py2 apps for awhile after.

SparkQA · 2018-06-07T17:10:04Z

Test build #91530 has finished for PR 21092 at commit 6a6d69d.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-06-07T17:29:19Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/3692/

SparkQA · 2018-06-07T17:35:12Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/3692/

SparkQA · 2018-06-07T22:28:28Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/3694/

SparkQA · 2018-06-07T22:34:17Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/3694/

ifilonenko · 2018-06-07T22:57:59Z

KubernetesSuite:
- Run SparkPi with no resources
- Run SparkPi with a very long application name.
- Run SparkPi with a master URL without a scheme.
- Run SparkPi with an argument.
- Run SparkPi with custom labels, annotations, and environment variables.
- Run SparkPi with a test secret mounted into the driver and executor pods
- Run extraJVMOptions check on driver
- Run SparkRemoteFileTest using a remote data file
- Run PySpark on simple pi.py example
- Run PySpark with Python2 to test a pyfiles example
- Run PySpark with Python3 to test a pyfiles example
Run completed in 4 minutes, 28 seconds.
Total number of tests run: 11
Suites: completed 2, aborted 0
Tests: succeeded 11, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 05:24 min
[INFO] Finished at: 2018-06-07T18:54:42-04:00
[INFO] Final Memory: 21M/509M
[INFO] ------------------------------------------------------------------------

For new addition to: apache-spark-on-k8s/spark-integration#46

SparkQA · 2018-06-08T02:33:01Z

Test build #91537 has finished for PR 21092 at commit ab92913.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk

This is super close, thank you for the integration tests really great work. Just a small improvement in the docs and one small unit test is what I see left. We should make sure other folks have a chance for any last comments but hopefully we can merge this next week unless something surprising comes up :)

holdenk · 2018-06-08T16:17:54Z

+    assert(kubernetesConfWithoutMainJar.sparkConf.get(MEMORY_OVERHEAD_FACTOR) === 0.1)
  }

+  test("Creating driver conf with a python primary file") {


Just a follow up we should have a test for with Python and overriding MEMORY_OVERHEAD_FACTOR (e.g. test to make sure that setIfMissing since we had it the other way earlier in the PR).

holdenk · 2018-06-08T16:28:06Z

+  <td><code>spark.kubernetes.memoryOverheadFactor</code></td>
+  <td><code>0.1</code></td>
+  <td>
+    This sets the Memory Overhead Factor that will allocate memory to non-JVM jobs which in the case of JVM tasks will default to 0.10 and 0.40 for non-JVM jobs.


I think we can maybe improve this documentation a little bit. It's not so much how much memory is set aside for non-JVM jobs, it's how much memory is set aside for non-JVM memory, including off-heap allocations, non-JVM jobs (like Python or R), and system processes.

holdenk · 2018-06-08T16:32:28Z

+      Some(inputPyFiles.mkString(",")))
+    assert(kubernetesConfWithMainResource.sparkConf.get("spark.jars").split(",")
+      === Array("local:///opt/spark/jar1.jar"))
+    assert(kubernetesConfWithMainResource.sparkConf.get(MEMORY_OVERHEAD_FACTOR) === 0.4)


Just as we discussed earlier testing this value explicitly configured with Python would be good to have as well.

holdenk

LGTM pending Jenkins and sign-off from someone with K8s background.

SparkQA · 2018-06-08T17:09:59Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/3712/

SparkQA · 2018-06-08T17:15:34Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-spark-integration/3712/

mccheah

LGTM, will merge to master.

felixcheung · 2018-06-08T20:22:11Z

awesome!

SparkQA · 2018-06-08T21:06:48Z

Test build #91573 has finished for PR 21092 at commit a61d897.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

lucashu1 · 2018-06-14T05:12:37Z

Sorry in advance if this is the wrong place to be asking this!

Does this PR mean that we'll be able to create SparkContexts using PySpark's SparkSession.Builder with master set to k8s://<...>:<...>, and have the resulting jobs run on spark-on-k8s, instead of on local/standalone?

E.g.:

from pyspark.sql import SparkSession
spark = SparkSession.builder.master('k8s://https://kubernetes:443').getOrCreate()

I'm trying to use PySpark in a Jupyter notebook that's running inside a Kubernetes pod, and have it use spark-on-k8s instead of resorting to using local[*] as master.

Till now, I've been getting an error saying that:

Error: Python applications are currently not supported for Kubernetes.

whenever I try to use k8s://<...> as master.

Thanks!

UPDATE: Stack Overflow question here in case anyone has an answer!

felixcheung · 2018-06-14T06:08:59Z

@lucashu1 please send your question to stackoverflow or user@spark.apache.org!

HyukjinKwon · 2020-12-10T15:02:54Z

+      .createWithDefault(0.1)
+
+  val PYSPARK_MAJOR_PYTHON_VERSION =
+    ConfigBuilder("spark.kubernetes.pyspark.pythonversion")


Sorry for leaving a comment in an ancient PR but I couldn't hold it. Why did we add a configuration to control Python version instead of using the existent PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON?

Doing this in a configuration breaks or disables many things, for example, PEX (https://medium.com/criteo-labs/packaging-code-with-pex-a-pyspark-example-9057f9f144f3) that requires to set PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON manually.

cc @dongjoon-hyun too FYI. Conda / virtualenv support enabled by #30486 wouldn't work in Kubernates because of this.

@HyukjinKwon sounds reasonable to include support for that, we just need to agree on a policy for which takes precedence.

ifilonenko and others added 5 commits April 15, 2018 23:23

initial architecture for PySpark w/o dockerfile work

fb5b9ed

included entrypoint logic

b7b3db0

satisfying integration tests

98cef8c

end-to-end working pyspark

dc670dc

Merge pull request #1 from ifilonenko/py-spark

eabe4b9

Initial architecture for PySpark w/o dependency management

foxish reviewed Apr 18, 2018

View reviewed changes

This was referenced Apr 18, 2018

Python support kubeflow/spark-operator#124

Closed

Sync with apache/spark/master kubeflow/spark-operator#129

Closed

holdenk reviewed Apr 20, 2018

View reviewed changes

ifilonenko commented Apr 20, 2018

View reviewed changes

ifilonenko and others added 3 commits May 2, 2018 03:56

resolved comments and fixed --pyfiles issue and allowed for python2 o…

8d3debb

…r python3 to be specified

Merge pull request #2 from ifilonenko/py-spark

91e2a2c

resolved comments and fixed --pyfiles issue and allowed for python2 o…

Merge branch 'master' of https://github.com/ifilonenko/spark

5761ee8

ifilonenko changed the title ~~[SPARK-23984][K8S][WIP] Initial Python Bindings for PySpark on K8s~~ [SPARK-23984][K8S] Initial Python Bindings for PySpark on K8s May 2, 2018

holdenk reviewed May 4, 2018

View reviewed changes

holdenk reviewed Jun 1, 2018

View reviewed changes

kokes reviewed Jun 7, 2018

View reviewed changes

kokes mentioned this pull request Jun 7, 2018

[SPARK-13587] [PYSPARK] Support virtualenv in pyspark #13599

Closed

added e2e tests on --py-files and inclusion of docs on config values

6a6d69d

style issues

ab92913

holdenk reviewed Jun 8, 2018

View reviewed changes

resolve comments on docs and addition of unit test

a61d897

holdenk reviewed Jun 8, 2018

View reviewed changes

mccheah approved these changes Jun 8, 2018

View reviewed changes

asfgit closed this in 1a644af Jun 8, 2018

liyinan926 mentioned this pull request Jun 11, 2018

Make the operator work for PySpark in spark master kubeflow/spark-operator#181

Closed

2 tasks

rayburgemeestre mentioned this pull request Jun 27, 2018

[SPARK-23146][WIP] Support client mode for Kubernetes in Out-Cluster mode #20451

Closed

HyukjinKwon reviewed Dec 10, 2020

View reviewed changes

Conversation

ifilonenko commented Apr 18, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Apr 18, 2018

Uh oh!

SparkQA commented Apr 18, 2018

Uh oh!

erikerlandson commented Apr 18, 2018

Uh oh!

foxish commented Apr 18, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ifilonenko Apr 18, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

foxish commented Apr 18, 2018

Uh oh!

holdenk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

ifilonenko commented Apr 18, 2018 •

edited

Loading

ifilonenko Apr 18, 2018 •

edited

Loading